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Abstract 



In this paper, we consider the statistical analysis of a protein interaction network. We propose a 
Bayesian model that uses a hierarchy of probabilistic assumptions about the way proteins interact with 
one another in order to: (i) identify the number of non-observable functional modules; (ii) estimate the 
degree of membership of proteins to modules; and (iii) estimate typical interaction patterns among the 
functional modules themselves. Our model describes large amount of (relational) data using a relatively 
small set of parameters that we can reliably estimate with an efficient inference algorithm. We apply our 
methodology to data on protein-to-protein interactions in saccharomyces cerevisiae to reveal proteins' 
diverse functional roles. The case study provides the basis for an overview of which scientific questions 
can be addressed using our methods, and for a discussion of technical issues. 

Keywords: Data analysis; Bayesian inference; Latent Variables; Hierarchical mixture model; Varia- 
tional Expectation-Maximization; Mean-field approximation; Relational data; Unipartite graphs. 

1 Introduction 

Relational data, which describe measurements on pairs of objects, arise in a variety of applications. Citation 
networks underlying scientific collections of papers are obtained from references, which connect pairs of 
papers; web-graphs are obtained from hyperlinks, which connect pairs of web-pages; protein networks are 
obtained from physical interaction records, which relate pairs of proteins. In this paper, the discussion 
develops intuitions for protein interaction networks obtained experimentally with yeast two-hybrid tests and 
others means. 

'Address correspondence to: Edo Airoldi, Carl Icahn Laboratory, Princeton University, Princeton, NJ 08544, USA. 
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There are important differences between models for relational data, such as protein-protein interactions, 
and non-relational data, such as protein attributes. Specifically, the exchangeability assumptions underlying 
models of non-relational data are typically violated by relational data (Airoldi et al., 2007). Descriptive data 
analyses of relational measurements consider a rich set of goals, which often include: (i) the identification 
of the number of non-observable groups of objects, e.g., functional modules or stable protein complexes; 
(ii) the estimation of the degree of membership of objects to groups, e.g., of protein to functional modules 
or protein complexes; and (iii) the estimation of typical interaction patterns among the groups themselves, 
e.g., among functional modules or protein complexes. While the first two tasks arise in non-relational data 
settings as well, the last task is specific to relational data settings. In addition to these descriptive goals 
we are often interested in inferring latent quantities that are useful for making predictions. In the context 
of protein interaction networks, we want to identify group memberships and interaction patterns that are 
instrumental in predicting new relations and object specific attributes; more specifically, one may try to 
predict interactions between pairs of proteins and individual proteins' functional annotations, using patterns 
of interaction between them, and between the stable protein complexes they belong to. 

1.1 Novel Contributions 

In this paper, we propose the Admixture of Latent Blocks (ALB), where proteins exhibit membership in 
multiple latent groups. We develop efficient posterior inference algorithms for discovering the membership 
of proteins to groups from large collections of observed protein-protein interaction data. 

In the context of protein interaction networks, mixed membership relaxes the mixture modeling assump- 
tion that each protein belongs to a single group. ALB uses the mixed-membership of the proteins to explain 
interactions measured between them. Specifically, a latent stochastic block structure allows us to model 
interaction patterns among the groups by encoding the probabilities according to which pairs of individual 
proteins interact as generic members of the corresponding pairs of groups. 

We develop an efficient inference algorithm based on variational methods. This provides a fast alter- 
native to MCMC, and allows us to analyze the large collections of relational data that arise in biological 
applications. 
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We apply our methodology to a large protein interaction network to reveal proteins' diverse functional 
roles. The case study in Section 4 illustrates the scientific questions that can be addressed with our model, 
the alternative inference and estimations strategies, and the technical issues that arise in such analyses. 

1.2 Related Work 

There is a history of probabilistic models for relational data analysis in Statistics. Part of this literature 
is rooted in the stochastic block modeling ideas from psychometrics and sociology. This model is due to 
Holland and Leinhardt (1975), and was later elaborated upon by others (see, e.g., Fienberg et al., 1985; 
Wasserman and Pattison, 1996; Snijders, 2002). In machine learning, Markov random networks have been 
used for link prediction (Taskar et al., 2003) and the traditional block models from Statistics have been 
extended with nonparametric Bayesian priors (Kemp et al., 2004, 2006). 

Mixed membership models for clustering have emerged as a powerful and popular analytical tool for 
analyzing large databases involving text (Hofmann, 1999; Blei et al., 2003), text and references (Cohn and 
Hofmann, 2001; Erosheva et al., 2004), text and images (Barnard et al., 2003), multiple disability measures 
(Erosheva and Fienberg, 2005; Manton et al., 1994), and genetics information (Rosenberg et al., 2002; 
Pritchard et al., 2000; Xing et al., 2003). These models use a simple generative model, such as bag-of- 
words or naive Bayes, embedded in a larger hierarchical model that involves a latent variable structure. This 
induces dependencies between the observed data, and introduces statistical control over the estimation of 
what might otherwise be an extremely large set of parameters. 

2 The Scientific Problem 

Our goal is to analyze proteins' diverse functional roles by analyzing their local and global patterns of 
interaction. The biochemical composition of individual proteins make them suitable for carrying out a 
specific set of cellular operations, or functions. Proteins typically carry out these functions as part of stable 
protein complexes (Krogan et al., 2006). There are many situations in which proteins are believed to interact 
(Alberts et al., 2002); the main intuition behind our methodology is that pairs of protein interact because 
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they are part of the same stable protein complex, i.e., co-location, or because they are part of interacting 
protein complexes as they carry out compatible cellular operations. 



2.1 Protein Interactions and Functional Annotations 



The Munich Institute for Protein Sequencing (MIPS) database was created in 1998 based on evidence de- 
rived from a variety of experimental techniques, but does not include information from high-throughput data 
sets (Mewes et al., 2004). It contains about 8000 protein complex associations in yeast. We analyze a subset 
of this collection containing 871 proteins, the interactions amongst which were hand-curated. The institute 
also provides a set of functional annotations, alternative to the gene ontology (GO). These annotations are 
organized in a tree, with 15 general functions at the first level, 72 more specific functions at an intermediate 
level, and 255 annotations at the the leaf level. In Table 1 we map the 871 proteins in our collections to 
the main functions of the MIPS annotation tree; proteins in our sub-collection have about 2.4 functional 
annotations on average. 2 



# 


Category 


Count 


# 


Category 


Count 


1 


Metabolism 


125 


9 


Interaction w/ cell, environment 


18 


2 


Energy 


56 


10 


Cellular regulation 


37 


3 


Cell cycle & DNA processing 


162 


11 


Cellular other 


78 


4 


Transcription (tRNA) 


258 


12 


Control of cell organization 


36 


5 


Protein synthesis 


220 


13 


Sub-cellular activities 


789 


6 


Protein fate 


170 


14 


Protein regulators 


1 


7 


Cellular transportation 


122 


15 


Transport facilitation 


41 


8 


Cell rescue, defence & virulence 


6 









Table 1: The 15 high-level functional categories obtained by cutting the MIPS annotation tree at the first 
level and how many proteins (among the 871 we consider) participate in each of them. Most proteins 
participate in more than one functional category, with an average of 2.4 functional annotations. 

By mapping proteins to the 15 general functions, we obtain a 15-dimensional representation for each 
protein. In Figure 1 each panel corresponds to a protein; the 15 functional categories are ordered as in Table 
1 on the X axis, whereas the presence or absence of the corresponding functional annotation is displayed 
on the Y axis. 

2 We note that the relative importance of functional categories in our sub-collection, in terms of the number of proteins involved, 
is different from the relative importance of functional categories over the entire MIPS collection. 
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Figure 1: By mapping individual proteins to the 15 general functions in Table 1, we obtain a 15 -dimensional 
representation for each protein. Here, each panel corresponds to a protein; the 15 functional categories are 
displayed on the X axis, whereas the presence or absence of the corresponding functional annotation is 
displayed on the Y axis. The plots at the bottom zoom into three example panels (proteins). 



3 The Admixture of Latent Blocks Model 



The admixture of latent blocks (ALB) models observed protein interaction networks, G\ : m = {Rx-.Mi'P)- 
The presence or absence of a physical interaction among pairs of proteins p, q € V is measured over M 
distinct experimental conditions and encoded by Bernoulli random variables R m (p, <?)• Let us assume that 
the we observe networks among N := \V\ distinct proteins. 
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Mixed membership analysis: Relational data 



Stable protein complexes are believed to play a major role in cellular processes (Krogan et al., 2006). As 
a consequence, protein interaction networks provide insights into individual protein's functionality to the 
extent to which they carry information about the membership of individual proteins to stable protein com- 
plexes. In a complex biological system, many proteins are functionally versatile and distinct copies of the 
same protein participate in multiple protein complexes, which perform different cellular processes (i.e., 
functions) at different times or under different biological conditions. Thus, when modeling interactions as 
observable outcomes of latent functional processes, it is natural to adopt a flexible model which allows dis- 
tinct copies of a protein to interact with other proteins in multiple, functionally related biological contexts. 

For example, a signal transduction protein may sometimes interact with a cellular membrane protein 
as part of a signal receptor; at another time, it may interact with the transcription complex as an auxiliary 
transcription factor. Furthermore, there is direct empirical evidence that individual proteins may perform 
multiple functions while taking part in a cellular process, 3 and that they typically carry them out as members 
of stable protein complexes. 

The mixed membership assumption provides our model with such a desirable feature. Under this as- 
sumption, we introduce mixed membership vectors tti-n, such that tt^ is the probability according to 
which copies of the n-th protein participate into copies of the k-th protein complex, and Ylk^nk = 1- 
In the ALB model, mixed membership is a global feature of the behavior of a protein, i.e., it emerges from 
the composition of the collection of individual interactions a protein is involved with. Each these individual 
interactions is characterized by a single membership of each of the two proteins involved to a pair of stable 
protein complexes. 

Such single memberships of pairs of proteins to pairs of stable protein complexes for one observed inter- 
action, R m (p, q), are encoded in one corresponding pair of latent protein complex indicators, (z^ q , z^_ q ). 
These interaction- specific indicators induce flexibility in the model both at the protein level and at the inter- 
action level. Specifically, (i) distinct copies of the same protein to interact with other proteins as a member 

3 As an example, see the data about functional annotations of about 850 proteins in Yeast is presented in Figure 1. The data was 
provided, courtesy the Munich Institute for Protein Sequencing (Mewes, Amid, Arnold, Frishman, Guldener, and et. al (2004)). 
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of different stable protein complexes, e.g., under the same experimental conditions; and (ii) distinct inter- 
actions between copies of the same pair of proteins to be the expression of interactions of different pairs 
of stable protein complexes, e.g., under different experimental conditions. In the ALB model, each latent 
indicator z^L> q (respectively, z^_ q ) is a multinomial random vector with unitary size parameter, so that a 
single membership is sampled for each unit measurement, and with probabilities tt p (respectively, jr q ) to 
constrain such single memberships to follow the appropriate mixed membership profile of the p-th protein 
to the K protein complexes. 

During estimation and inference, the model recovers the non-observable stable protein complexes that 
are likely to carry out the functional processes underlying the data, in terms of the degree to which proteins 
in V take part in them, i.e., in terms of the mixed membership vectors 7?i : jv, by assessing the similarity of 
observed protein-to-protein interaction patterns. 

3.2 Latent Blocks 

In the experimental setting where measurements of physical interactions are taken, a certain number of 
stable protein complexes exist. The number of functions underlying the data, which are carried out by these 
complexes, as well as the interaction patterns among such complexes are also of primary interest. 

A scalar parameter, K, encodes the number of non-observable protein functions underlying the collection 
of observed interactions in the ALB model. Assuming that K distinct functions exist, a latent block structure 
B encodes the interaction patterns among the K distinct stable protein complexes which carry them out. 
The latent block structure is a table of size (K x K). The generic entry B(g, h) in the table encodes the 
probability according to which the pair (p, q) of proteins interacts, whenever a copy of the p-th protein is a 
member of the g-th complex and a copy of the q-th protein is a member of the h-th complex. 

During estimation and inference, the model is fitted assuming that a pre-specified number of functions K 
underlies the data, and recovers the probabilities according to which pairs of individual proteins interact as 
generic members of the corresponding pairs of stable protein complexes, i.e., the interaction patterns among 
stable protein complexes B. 
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Figure 2: A graphical representation of the admixture of latent blocks (ALB) model. The boxes are plates 
and represent replicates of observed networks. White nodes represent latent variables, whereas shaded (gray) 
nodes represent observed interactions. 

3.3 The Data Generating Process 

The data generating process for M protein interaction networks, G\ : m, assumes that the number of individ- 
ual proteins, N := \V\, the number of protein complexes, K, their interaction patterns, B, and the average 
mixed membership of protein to functions, a, are given a-priori. 

The process then posits that, within the m-th network, each observed interaction R m {p, q), P,q G V, is 
a Bernoulli random variable 4 with probability of success a™ q . The single memberships, (z^ q , z^_ q ), and 
the latent block structure, B, provide competing explanations of the (scalar) probability of success of each 

4 Note that the data generating process does not necessarily generate symmetric interactions, R m (p,q) — R m (q,p), thus 
increasing the applicability of our methodology to relational data that arise in other domains. 
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observed interaction, R m {p, q), which is defined as follows, 



pq ^p^q J - J ~p<—q' 

The collection of single membership indicators of a protein p, {z!? : q G V, m = 1,..., M}, each a 
multinomial random vector with parameters (tt p , 1), is constrained by the global mixed membership behav- 
ior encoded in n p . The mixed memberships and the latent block structure provide competing explanations 
of the average probability of success of an interaction, given by a vq = 7? p T B 7r g , across the M experimental 
conditions. The mixed membership vectors are further constrained by positing they are (non-observable, in- 
dependent and identically distributed) samples from a common Dirichlet distribution with hyper-parameter 
a. 

The data generating process for G\-m = (Ri-.m,^) maps a small set of constants to the data, DGP : 
(V, M, K, a, B) — > R\;M, and it is instantiated as follows. 

1 . For each protein p G V 

1.1. Sample 7? p ~ Dirichlet (a). 

2. For each protein interaction network m = 1, . . . , M 

2.1. For each pair of proteins (p,q) G V (8> V 

2.1.1. Sample group ~ Multinomial (tt p , 1) 

2.1.2. Sample group z^_ q ~ Multinomial (Tr q , 1) 

2.1.3. Sample R m (p, q) ~ Bernoulli {z™J q B z£!_ q ) 

The process above suggests a hierarchical decomposition of the joint probability distribution of the ob- 
servations, R\;M, and the latent variables, 5 {tti-n, Z^ m , Z^T m ); that is, the integrand in Equation 1. By 

5 Two sets of latent protein complex indicators corresponding to the m-th protein network are denoted compactly by, 

{%L q : p, q 6 V} =: Z£ and {2£_ q : p, q 6 V} =: Z£. 
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integrating the latent variables out of the joint we obtain the likelihood of the observations, 




■1:M, 7Tl:AT, Z^ M , Z^7 M \a, B) dlT dZ 



(1) 



Jn®z 



x p2(z p n ^ g \T? q , 1) J p3(iTp\a) dm dZ 



B) p 2 (z p n ^ q \TT p ,l) X 



where p\ is Bernoulli, p 2 is multinomial, and p% is Dirichlet. A graphical representation using plates of the 
statistical models corresponding to data generating process is given in Figure 2. 

A recurring question, which bears relevance to mixed membership models in general, is why one does 



above. There are some computational aspects to this but a practical issue that argues against such marginal- 
ization is that we would often lose interpretable quantities that are useful for making predictions, for de- 
noising new measurements, or for performing other tasks. In fact, the posterior distributions of such quanti- 
ties typically carry substantive information about elements of the application at hand. In the application to 
protein interaction networks, for example, they encode the interaction-specific memberships of individual 
proteins to protein complexes. 

3.4 Modeling Rare Interactions 

The specifications of the data generating process suggests that observations about interactions and non- 
interactions are equally important in terms of their contributions to model fitness, e.g., see the integrand 
in Equation 1. In other words, they equally compete for a likely explanation in terms of estimates for 
(a, B, tti-n). As a consequence, in experimental settings where interactions are rare, the estimation and 
inference tasks will find hyper-parameter values and posterior distributions that explain patterns of non- 
interaction rather than patterns of interaction. 

In order to be able to calibrate the importance of rare interactions, we introduce the sparsity parameter 
p £ [0, 1], which models how often a non-interaction is due to noise and how often it carries information 



not necessarily want to integrate out the single membership indicators — [z, 



) in the specifications 
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about proteins' memberships to protein complexes. This leads to a new generative process, where the non- 
interactions are generated from a mixture between the original Bernoulli (step 2.1.3. in the data generating 
process) and a point mass at zero, with weights (1 — p) and p respectively. The probabilities of non- 
interactions are set to, 

1- a^ q = (1- p) ■ ^J q (l - B) z^ q + p, 
and the probabilities of successful interactions in the data generating process become, 

0~, pq = (1 — p) Zp^ q B Z p ^ q 

= z mT B' z™ 

where B'(g, h) = (1 - p) ■ B(g, h) for g, h = 1, . . . , K. 

During estimation and inference, a large value of p leads to interactions in the matrix being weighted 
more than non-interactions to the extent of informing the estimates of (a, B, tti-.n)- In fact, when p 1 the 
most likely explanation for non-interactions is generic noise, i.e., zeros are likely to be generated from the 
point mass. 

3.5 Parameter Estimation and Posterior Inference 

We develop estimation strategies for the hyper-parameters (a, B) within the empirical Bayes framework 
(Morris, 1983; Carlin and Louis, 2005). We develop a variational approximation to Expectation-Maximization 
(EM) to carry out posterior inference for the latent mixed-membership vectors, tt\-n- A description of the 
algorithm and the mathematical derivations are presented elsewhere (Airoldi et al., 2007). The optimal 
number of blocks, K, is selected via cross-validation on the held-out likelihood. 

Briefly, in order to estimate (a, B) and infer posterior distributions for 7Ti : tv we need to be able to 
evaluate the likelihood, which involves the non-tractable integral in Equation 1. Given the large amount 
of data available about biological (e.g., protein) networks, approximate posterior inference strategies are 
considered in the context of variational methods; a computationally cheaper alternative to Monte Carlo 
Markov chain methods. Using variational methods, we find a tractable lower bound for the likelihood that 
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can be used as a surrogate for inference puiposes. This leads to approximate MLEs for the hyper-parameters 
and approximate posterior distributions for the (latent) mixed-membership vectors. 

We introduce a variant of variational EM for our model, which we term nested variational EM algorithm. 
Our algorithm improves the naive variational EM in two aspects: (i) it is parallelizable when applied to 
relational data; and (ii) it reduces memory requirements from {NK + N 2 K) to (NK + K) per iteration. 

4 Application to Protein Interactions in Saccharomyces Cerevisiae 

Protein-protein interactions (PPI) form the physical basis for the formation of complexes and pathways that 
carry out different biological processes. A number of high-throughput experimental approaches have been 
applied to determine the set of interacting proteins on a proteome-wide scale in yeast. These include the 
two-hybrid (Y2H) screens and mass spectrometry methods. Mass spectrometry can be used to identify 
components of protein complexes (Gavin et al., 2002; Ho et al., 2002). 

High-throughput methods, though, may miss complexes that are not present under the given conditions. 
For example, tagging may disturb complex formation and weakly associated components may dissociate and 
escape detection. Statistical models that encode information about functional processes with high precision 
are an essential tool for carrying out probabilistic de-noising of biological signals from high-throughput 
experiments. 

Our goal is to identify the proteins' diverse functional roles by analyzing their local and global patterns 
of interaction via ALB. The biochemical composition of individual proteins make them suitable for carrying 
out a specific set of cellular operations, or functions. Proteins typically carry out these functions as part of 
stable protein complexes (Krogan et al., 2006). There are many situations in which proteins are believed to 
interact (Alberts et al., 2002). The main intuition behind our methodology is that pairs of protein interact 
because they are part of the same stable protein complex, i.e., co-location, or because they are part of 
interacting protein complexes as they carry out compatible cellular operations. 
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Mixed membership analysis: Relational data 



In previous work, we established the usefulness of an admixture of latent blockmodels for analyzing protein- 
protein interaction data (Airoldi et al., 2005). For example, we used the ALB for testing functional interac- 
tion hypotheses (by setting a null hypothesis for B), and unsupervised estimation experiments. In the next 
Section, we assess whether, and how much, functionally relevant biological signal can be captured in by the 
ALB. 

In summary, the results in Airoldi et al. (2005) show that the ALB identifies protein complexes whose 
member proteins are tightly interacting with one another. The identifiable protein complexes correlate with 
the following four categories of Table 1: cell cycle & DNA processing, transcription, protein synthesis, and 
sub-cellular activities. The high correlation of inferred protein complexes can be leveraged for predicting the 
presence of absence of functional annotations, for example, by using a logistic regression. However, there 
is not enough signal in the data to independently predict annotations in other functional categories. The 
empirical Bayes estimates of the hyper-parameters that support these conclusions in the various types of 
analyses are consistent; a < 1 and small; and B nearly block diagonal with two positive blocks comprising 
the four identifiable protein complexes. In these previous analyses, we fixed the number of latent protein 
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Figure 3: We estimate the mapping of latent groups to functions. The two plots show the marginal fre- 
quencies of membership of proteins to true functions (bottom) and to identified functions (top), in the cross- 
validation experiment. The mapping is selected to maximize the accuracy of the predictions on the training 
set, in the cross-validation experiment, and to minimize the divergence between marginal true and predicted 
frequencies if no training data is available — see Section 4.1. 
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Figure 4: Predicted mixed-membership probabilities (dashed, red lines) versus binary manually curated 
functional annotations (solid, black lines) for 6 example proteins. The identification of latent groups to 
functions is estimated, Figure 3. 

complexes to 15; the number of broad functional categories in Table 1. 

The latent protein complexes are not a-priori identifiable in our model. To resolve this, we estimated a 
mapping between latent complexes and functions by minimizing the divergence between true and predicted 
marginal frequencies of membership, where the truth was evaluated on a small fraction of the interactions. 
We used this mapping to compare predicted versus known functional annotations for all proteins. The 
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best estimated mapping is shown in the left panel of Figure 3, along with the marginal latent category 
membership, and it is compared to the 15 broad functional categories Table 1, along with the known category 
membership (in the MIPS database), in the right panel. Figure 4 displays a few examples of predicted 
mixed membership probabilities against the true annotations, given the estimated mapping of latent protein 
complexes to functional categories. 

4.2 Measuring the Functional Content in the Posterior 

In a follow-up study we considered the gene ontology (GO) (Ashburner et al., 2000) as the source of func- 
tional annotations to consider as ground truth in our analyses. GO is a broader and finer grained functional 
annotation scheme if compared to that produced by the Munich Institute for Protein Sequencing. Further- 
more, we explored a much larger model space than in the previous study, in order to tests to what extent ALB 
can reduce the dimensionality of the data while revealing substantive information about the functionality of 
proteins that can be used to inform subsequent analyses. We fit models with a number blocks up to K = 225. 
Thanks to our nested variational inference algorithm, we were able to perform five-fold cross-validation for 
each value of K. We determined that a fairly parsimonious model (K* = 50) provides a good description 
of the observed protein interaction network. This fact is (qualitatively) consistent with the quality of the 
predictions that were obtained with a parsimonious model (K = 15) in the previous section, in a different 
setting. This finding supports the hypothesis that groups of interacting proteins in the MIPS data set encode 
biological signal at a scale of aggregation that is higher than that of protein complexes. 6 

We settled on a model with K* = 50 blocks. To evaluate the functional content of the interactions 
predicted by such model, we first computed the posterior probabilities of interactions by thresholding the 
posterior expectations 



and we then computed the precision-recall curves corresponding to these predictions. These curves are 

6 It has been recently suggested that stable protein complexes average five proteins in size (Krogan et al., 2006). Thus, if ALB 
captured biological signal at the protein-complex resolution, we would expect the optimal number of groups to be much higher 
(Disregarding mixed membership, 871/5 ~ 175.) 



and E[fi( M ) = l]4„'^, 



15 



Airoldi et al. 



Mixed membership analysis: Relational data 



shown in Figure 5 as the light blue (— x) line and the the dark blue ( — h) line. In Figure 5 we also plotted 
the functional content of the original MIPS collection. This plot confirms that the MIPS collection of in- 
teractions, our data, is one of the most precise (the Y axis measures precision) and most extensive (the X 
axis measures the amount of functional annotations predicted, a measure of recall) source of biologically 
relevant interactions available to date — the yellow diamond, point # 2. The posterior means of (tti-.n) an d 
the estimates of (a, B) provide a parsimonious representation for the MIPS collection, and lead to precise 
interaction estimates, in moderate amount (the light blue, — x line). The posterior means of (£_>, Z+J) pro- 
vide a richer representation for the data, and describe most of the functional content of the MIPS collection 
with high precision (the dark blue, — h line). Most importantly, notice the estimated protein interaction 
networks, i.e., pluses and crosses, corresponding to lower levels of recall feature a more precise functional 
content than the original. This means that the proposed latent block structure is helpful in summarizing 
the collection of interactions — by ranking them properly. (It also happens that dense blocks of predicted 
interactions contain known functional predictions that were not in the MIPS collection.) Table 2 provides 
more information about three instances of predicted interaction networks displayed in Figure 5; namely, 
those corresponding the points annotated with the numbers 1 (a collection of interactions predicted with the 
7r's), 2 (the original MIPS collection of interactions), and 3 (a collection of interactions predicted with the 
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Figure 5: In the top panel we measure the functional content of the the MIPS collection of protein interac- 
tions (yellow diamond), and compare it against other published collections of interactions and microarray 
data, and to the posterior estimates of the ALB models — computed as described in Section 4.2. A break- 
down of three estimated interaction networks (the points annotated 1, 2, and 3) into most represented gene 
ontology categories is detailed in Table 2. 
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Table 2: Breakdown of three example interaction networks into most represented gene ontology categories — 
see text for more details. The digit in the first column indicates the example network in Figure 5 that any 
given line refers to. The last two columns quote the number of predicted, and possible pairs for each GO 
term. 

4>'s). Specifically, the table shows a breakdown of the predicted (posterior) collections of interactions in 
each example network into the gene ontology categories. A count in the second-to-last column of Table 2 
corresponds to the fact that both proteins are annotated with the same GO functional category. 7 Figure 6 in- 
vestigates the correlations between the data sets (in rows) we considered in Figure 5 and few gene ontology 
categories (in columns). The intensity of the square (red is high) measures the area under the precision-recall 
curve (Myers et al., 2006). 

In this application, the ALB learned information about (i) the mixed membership of objects to latent 
7 Note that, in GO, proteins are typically annotated to multiple functional categories. 
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groups, and (ii) the connectivity patterns among latent groups. These quantities were useful in describing 
and summarizing the functional content of the MIPS collection of protein interactions. This suggests the use 
of ALB as a dimensionality reduction approach that may be useful for performing model-driven de-noising 
of new collections of interactions, such as those measured via high-throughput experiments. 

5 Conclusions 

When applied to a sample of measurements on pairs of objects, Admixture of Latent Blocks simultaneously 
extracts information about (i) the mixed membership of objects to latent aspects, and (ii) the connectivity 
patterns among latent aspects, using a nested variational EM algorithm. 

We found it useful in describing and summarizing the functional content of a protein interaction network, 
and we envision its use for de-noising new collection of interactions from high-throughput experiments. 
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